latent concept
- North America > United States > West Virginia (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Puerto Rico (0.04)
- (11 more...)
- Health & Medicine (0.93)
- Media > Film (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (1.00)
- (2 more...)
On the Convergence of Moral Self-Correction in Large Language Models
Liu, Guangliang, Mao, Haitao, Cao, Bochuan, Xue, Zhiyu, Zhang, Xitong, Wang, Rongrong, Johnson, Kristen Marie
Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Michigan (0.04)
- North America > United States > Pennsylvania (0.04)
- (2 more...)
- North America > United States > West Virginia (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Puerto Rico (0.04)
- (11 more...)
- Health & Medicine (0.93)
- Media > Film (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (3 more...)
Analyzing Latent Concepts in Code Language Models
Sharma, Arushi, Pungliya, Vedant, Quinn, Christopher J., Jannesari, Ali
Interpreting the internal behavior of large language models trained on code remains a critical challenge, particularly for applications demanding trust, transparency, and semantic robustness. We propose Code Concept Analysis (CoCoA): a global post-hoc interpretability framework that uncovers emergent lexical, syntactic, and semantic structures in a code language model's representation space by clustering contextualized token embeddings into human-interpretable concept groups. We propose a hybrid annotation pipeline that combines static analysis tool-based syntactic alignment with prompt-engineered large language models (LLMs), enabling scalable labeling of latent concepts across abstraction levels. We analyse the distribution of concepts across layers and across three finetuning tasks. Emergent concept clusters can help identify unexpected latent interactions and be used to identify trends and biases within the model's learned representations. We further integrate LCA with local attribution methods to produce concept-grounded explanations, improving the coherence and interpretability of token-level saliency. Empirical evaluations across multiple models and tasks show that LCA discovers concepts that remain stable under semantic-preserving perturbations (average Cluster Sensitivity Index, CSI = 0.288) and evolve predictably with fine-tuning. In a user study on the programming-language classification task, concept-augmented explanations disambiguated token roles and improved human-centric explainability by 37 percentage points compared with token-level attributions using Integrated Gradients.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Iowa (0.04)
- North America > United States > California > Sacramento County > Sacramento (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Nonparametric Identification of Latent Concepts
Zheng, Yujia, Xie, Shaoan, Zhang, Kun
We are born with the ability to learn concepts by comparing diverse observations. This helps us to understand the new world in a compositional manner and facilitates extrapolation, as objects naturally consist of multiple concepts. In this work, we argue that the cognitive mechanism of comparison, fundamental to human learning, is also vital for machines to recover true concepts underlying the data. This offers correctness guarantees for the field of concept learning, which, despite its impressive empirical successes, still lacks general theoretical support. Specifically, we aim to develop a theoretical framework for the identifiability of concepts with multiple classes of observations. We show that with sufficient diversity across classes, hidden concepts can be identified without assuming specific concept types, functional relations, or parametric generative models. Interestingly, even when conditions are not globally satisfied, we can still provide alternative guarantees for as many concepts as possible based on local comparisons, thereby extending the applicability of our theory to more flexible scenarios. Moreover, the hidden structure between classes and concepts can also be identified nonparametrically. We validate our theoretical results in both synthetic and real-world settings.
- Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
- North America > United States > California (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
Latent Concept Disentanglement in Transformer-based Language Models
Hong, Guan Zhe, Vasudeva, Bhavya, Sharan, Vatsal, Rashtchian, Cyrus, Raghavan, Prabhakar, Panigrahy, Rina
When large language models (LLMs) use in-context learning (ICL) to solve a new task, they must infer latent concepts from demonstration examples. This raises the question of whether and how transformers represent latent structures as part of their computation. Our work experiments with several controlled tasks, studying this question using mechanistic interpretability. First, we show that in transitive reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. This builds upon prior work that analyzes single-step reasoning. Then, we consider tasks parameterized by a latent numerical concept. We discover low-dimensional subspaces in the model's representation space, where the geometry cleanly reflects the underlying parameterization. Overall, we show that small and large models can indeed disentangle and utilize latent concepts that they learn in-context from a handful of abbreviated demonstrations.
Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval
Park, Seongwan, Kim, Taeklim, Ko, Youngjoong
Despite their strong performance, Dense Passage Retrieval (DPR) models suffer from a lack of interpretability. In this work, we propose a novel interpretability framework that leverages Sparse Autoencoders (SAEs) to decompose previously uninterpretable dense embeddings from DPR models into distinct, interpretable latent concepts. We generate natural language descriptions for each latent concept, enabling human interpretations of both the dense embeddings and the query-document similarity scores of DPR models. We further introduce Concept-Level Sparse Retrieval (CL-SR), a retrieval framework that directly utilizes the extracted latent concepts as indexing units. CL-SR effectively combines the semantic expressiveness of dense embeddings with the transparency and efficiency of sparse representations. We show that CL-SR achieves high index-space and computational efficiency while maintaining robust performance across vocabulary and semantic mismatches.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)
Addressing Leakage in Concept Bottleneck Models
Concept bottleneck models (CBMs) (Chen et al., 2020; Koh et al., 2020; Y eh et al., 2020; Lage & Doshi-V elez, 2020; Wang et al., 2017) propose explicitly aligning the intermediate layers of a The concept bottleneck model (CBM) makes its prediction in two stages. While soft concepts may improve predictive performance, this improvement comes at a cost.